# load libraries
library(knitr)
library(tidyverse)9 Tidy data
Learning Objectives
After completing this lab you should be able to
- describe what a tidy data set is.
- use
pivot_longer()to convert a wide data set to a tidy (long) data set. - use
pivot_wider()to convert a long (tidy) data set to a wide data set.
For each of our modules we will have a project-folder with an Rproject, *.qmd-files, and sub-directories for data, scripts, and results as described in our Rproject Tutorial. You should have a directory on your Desktop or Documents folder on your laptop (name it something like bi349) as a home directory for all of our project folders this semester.
You should have already downloaded the project directory for this module, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.
Once you have opened a project you should see the project name in the top right corner1.
1 Pro tip: If you run into issues where a quarto document won’t render or file paths aren’t working (especially if things were working previously) one of your first steps should be to double check that the correct Rproj is loaded.
There should be a document named 09_tidy-data.qmd in your project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set2 you may of course add as many comments as you need to be able to recall what you did]. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.
2 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.
Let’s start by loading the packages we will need for this activity.
9.1 Producing tidy data sets
The last set of functions that we need to get comfortable with allow us to create tidy data sets.
Let’s read out data set back into our R session.
# read catch data
catch <- read_delim("data/longline_catchdata.txt", delim = "\t")Let’s quickly reformat our catch data as follows
catch_length <- catch %>%
unite(SetID, Year, Month, Day, Set, sep = "_") %>%
select(SetID, Site, Species, Sex, PCL, FL, STL)
head(catch_length)# A tibble: 6 × 7
SetID Site Species Sex PCL FL STL
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2015_7_28_1 Aransas_Bay Bagre_marinus U NA 287 353
2 2015_7_28_1 Aransas_Bay Bagre_marinus U NA 425 495
3 2015_7_28_1 Aransas_Bay Bagre_marinus U NA 416 502
4 2015_7_28_1 Aransas_Bay Bagre_marinus U NA 416 507
5 2015_7_28_1 Aransas_Bay Bagre_marinus U NA 418 510
6 2015_7_28_1 Aransas_Bay Bagre_marinus U NA 434 515
We can turn this into a tidy data set using pivot_longer(). To do this we have to identify columns that will be used as the key (cols =) and then name the column that will hold those values (names_to()) and the column that will hold the value (values_to()).
In this case, we have made three observations about length for each specimen, in order to have rows with unique observations we want a column that identifies what type of observation was made, for example called Measurement. This is called the “key” because it allows us to “unlock” what type of measurement the individual observation is, i.e. this column will let us know whether an observation (row) is pre-caudal length, fork length, or stretch total length.
We will designate another column Length to hold the values for each measurement.
We can identify the columns that need to be gathered either by name or since we have re-arranged our dataframe so they are the last columns by column number.
tidy_length <- catch_length %>%
pivot_longer(names_to = "Measurement", values_to = "Length", cols = 5:7)With this data set it would be straightforward for us to e.g. calculate mean values for each length measurement by species using group_by() and summarize().
tidy_length %>%
group_by(Species, Measurement) %>%
summarize(mean = mean(Length, na.rm = TRUE))# A tibble: 42 × 3
# Groups: Species [14]
Species Measurement mean
<chr> <chr> <dbl>
1 Bagre_marinus FL 433.
2 Bagre_marinus PCL NaN
3 Bagre_marinus STL 517.
4 Carcharhinus_brevipinna FL 644.
5 Carcharhinus_brevipinna PCL 583.
6 Carcharhinus_brevipinna STL 804.
7 Carcharhinus_leucas FL 769
8 Carcharhinus_leucas PCL 691.
9 Carcharhinus_leucas STL 936.
10 Carcharhinus_limbatus FL 613.
# ℹ 32 more rows
9.2 Convert a tidy data set to wide format
Despite all the advantages of tidy data sets you can see from the table above that frequently when we are presenting results in a table it may be advantageous in terms of layout to have a non-tidy format.
This can be done using pivot_wider() which works like pivot_longer() but in reverse. You designate which column is the key (names_from =), i.e. these will become the column names in the new table. Then you need to identify which column in your current data frame contains the values that should be filled out/spread into the columns that will be generated from your key (values_from =).
Since we don’t have values for precaudal length, we probably want to use filter() to remove those rows first.
More notes on naming things … recall, that we said that filenames should not contain spaces or special characters? We set similar rules for naming objects. Well, column names is a similar conundrum. Including spaces or species characters as a column name creates problems when we are using functions like select() to subset by column name or mutate() to create new columns based on exisiting columns. Similarly, if the column name is a number you will have problems. If you do have unconvential column names you can rename them using rename() or you can use backticks and either side of the name to indicate that it is a column name.
tidy_length %>%
filter(!Measurement == "PCL") %>%
group_by(Species, Measurement) %>%
summarize(mean = mean(Length, na.rm = TRUE)) %>%
pivot_wider(names_from = "Measurement", values_from = "mean")# A tibble: 14 × 3
# Groups: Species [14]
Species FL STL
<chr> <dbl> <dbl>
1 Bagre_marinus 433. 517.
2 Carcharhinus_brevipinna 644. 804.
3 Carcharhinus_leucas 769 936.
4 Carcharhinus_limbatus 613. 776.
5 Carcharhinus_porosus 415 475
6 Hypanus_americanus NaN 954.
7 Hypanus_sabina NaN 349.
8 Rhinoptera_bonasus NaN 819
9 Rhizoprionodon_terraenovae 412 510.
10 Sciades_felis 299. 343.
11 Sciaenops_ocellatus 793 932.
12 Sphyrna_lewini 471. 628
13 Sphyrna_tiburo 622. 792.
14 Synodus_foetens 173 185